Exploratory Analysis of table actor¶

Filter unwanted columns¶

According to the wiki page, we can get rid of those columns:

  • standard_text_property
  • count_text_property
  • concat_names

Table extract¶

pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
14521 48893 Actr48893 Reginaldo - da Genova 1510.0 3 None NaN None None 1 None 104.0 30.0 2014-04-05 17:30:35.800 30.0 2014-04-05 17:30:36
36706 40238 Actr40238 Jacquet, Jean Bernardin 1831.0 1 None 1881.0 1 None 1 None 104.0 24.0 2010-11-18 11:09:01.000 24.0 2013-12-18 15:24:16
38719 43209 Actr43209 Albinus, Elisabeth 1595.0 1 2 1666.0 1 2 2 None 104.0 25.0 2011-05-26 11:53:19.000 25.0 2013-12-18 15:24:16
1025 14468 Actr14468 Denix, Guillaume 1673.0 1 None 1673.0 1 None 1 None 104.0 28.0 2008-12-04 16:35:47.000 11.0 2013-12-18 15:35:49
1292 14720 Actr14720 Frundin, Anne 1673.0 1 None NaN 1 None 2 None 104.0 28.0 2008-12-04 16:35:49.000 11.0 2013-12-18 15:35:49

Filter only wanted rows¶

Some of the rows has been identified to not be imported. They can be found with the "[à identifier]" string present in the column concat_standard_name.

Rows number before filter: 61556
Rows number after filter: 59625 (1931 has been removed)

Filter by Actor type¶

For now we are interested only in the persons.

Persons can be found by having the column fk_abob_type_actor being 104.

Number of not 104 actors: 3
pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
10340 59031 Actr59031 Forster, James 1830.0 3 3 1930.0 3 3 1 None 106.0 81.0 2016-11-29 11:05:00.060 81.0 2016-11-29 11:05:00
28956 60660 Actr60660 Valjean, Jean 1769.0 1 None 1833.0 1 None 1 None 106.0 122.0 2018-10-23 16:48:50.050 122.0 2018-10-23 16:48:50
46023 46914 Actr46914 Dieu (conception chrétienne) NaN 1 None NaN None None 0 None 106.0 3.0 2013-07-04 11:43:15.990 3.0 2013-12-18 15:24:16

Discovery¶

Columns contain:
Total number of rows: 59622
  -             "pk_actor":   0.00% empty - 59622 (100.00%) uniques (eg: 44895; 47015)
  -          "concat_actr":   0.00% empty - 59622 (100.00%) uniques (eg: Actr44895; Actr47015)
  - "concat_standard_name":   0.00% empty - 56635 ( 94.99%) uniques (eg: Sainte-Mar...; Costantino...)
  -        "creation_time":   0.00% empty - 34508 ( 57.88%) uniques (eg: 2012-04-08...; 2013-07-26...)
  -    "modification_time":   0.00% empty - 14053 ( 23.57%) uniques (eg: 2013-12-18...; 2016-10-21...)
  -              "creator":   0.01% empty -    88 (  0.15%) uniques (eg: 43.0; 30.0)
  -           "gender_iso":   0.04% empty -     4 (  0.01%) uniques (eg: 1; 2)
  -             "modifier":   8.90% empty -    85 (  0.14%) uniques (eg: 2.0; 30.0)
  -      "certainty_begin":   9.40% empty -     4 (  0.01%) uniques (eg: 3; 1)
  -        "certainty_end":  14.47% empty -     5 (  0.01%) uniques (eg: 3; None)
  -           "begin_year":  18.58% empty -   848 (  1.42%) uniques (eg: 1870.0; 1506.0)
  -             "end_year":  50.68% empty -   819 (  1.37%) uniques (eg: 1930.0; 1545.0)
  -          "notes_begin":  67.71% empty -     5 (  0.01%) uniques (eg: 3; 2)
  -            "notes_end":  72.40% empty -     6 (  0.01%) uniques (eg: 3; 4)
  -                "notes":  89.83% empty -  6031 ( 10.12%) uniques (eg: <p>Il s'ag...; None)

Type parsing¶

According to the table before, we will parse each column by the most meaningful type.

Columns analysis¶

Here we will report the analysis of interesting information found on different columns. They are not exhaustive.

For some of the column, we will update their value.

gender_iso¶

We observe some of the gender being undefined. As the ISO mentions, it should be 0, 1, 2 or 9. So we replace the undefined gender by 0.

certainty_begin¶

We replace the not filled values by 0.

begin_year¶

certainty_end¶

We replace the not filled values by 0.

end_year¶

creation_time¶

creator¶

notes¶

All HTML tags, non ASCII chars and new line are removed.